Keyword [MAttNet]

Liu X, Wang Z, Shao J, et al. Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing[J]. arXiv preprint arXiv:1903.00839, 2019.

1. Overview

1.1. Motivation

previous attention models focus on only the most dominant features of both modalities

In this paper, it designs a novel cross-modal attention-guided erasing approach

generate difficult training samples online
make full use of latent correspondences between training paris
avoid overly rely on specific words or visual concepts

Three types of erasing

Image-aware query sentence erasing. replace word with unknown token
Sentence-aware subject region erasing. erase the spatial features
Sentence-aware context object erasing. erase a dominant context region

1.2. Dataset

RefCOCO
RefCOCO+
RefCOCOg

2.1. Overview of Attention-guided Erasing

Query Sentence Erasing ($Q^*$).
Visual Erasing ($O^*$).

sample a module based on $Multinominal(3, [w_{subj}, w_{loc}, w_{rel}])$

subject region erasing on feature maps
context object erasing to discard features of a context object

Loss Function

2.2. Image-aware Query Sentence Erasing

encode the whole img, then feed into LSTM.
sample a word from Multinomial(T, [α_1, …, α_T])

2.3. Sentence-aware Subject Region Erasing

v_j. a feature point
erase a continuous region of size kxk (k=3)

2.4. Sentence-aware Context Object Erasing

c_k. context region features
m = {loc, rel}

Different from MAttNet

In relationship module, MAttNet assume only one contect object contributes to recognizing the subject
In this paper, it deals with all context objects and attend to important ones.

Finally, sample a context object based on Multinomial(K, [α_1, …, αK]) and replace its feature to zero.
(already choose which module based on Multinominal(3, [w{subj}, w_{loc}, w_{rel}]))

2.5. Details

Faster R-CNN with ResNet-101 as backbone to extract image features
For each candidate object proposal, 7x7 feature maps are fed into subject module